Introduction
The goal of this market segmentation analysis is to develop user “personas” to inform future marketing efforts and product development. This analysis will use an unsupervised machine learning approach to cluster the users into distinct personas. Unsupervised clustering allows for a data driven way to infer structure within the data. The distinct clusters generated can then be interpreted to understand how Duolingo users might be similar or different from each other. Many of the survey data features was collected as categorical variables, this analysis will use a Kmodes clustering algorithm. Some numerical data variables wwere converted into categories as Kmodes can only handle categorical variables. Kmodes works by iteratively comparing the similarity of each new point to k centroids. The new data point is then clustered with the cluster that it is most similar to, and a new centroid is calculated. The distance between cluster and the new point is measured by dissimilarity (total mismatches between data points).
Data source
User survey and usage data from Duolingo. The survey asked users a series of questions about demographics (e.g., country, age, employment status), and motivation (e.g., primary reason for studying a language). The goal of this survey was to develop user segments (or personas) to inform future marketing efforts and product development. Usage data was collected from August 1, 2018 to November 5, 2018.
EDA Summary
EDA data analysis revealed a significant portion (57%) of the daily goal values were NAs. Over 50% of users from (MX, FR, JP, RU) have a purchased subscription. Interestingly these 4 countries also have the lowest percentage of users in the 0-10,000 salary range and also ahve a high number of lessons completed. Other variables such has commitment, employment status, and age is less clear to see a difference from these geographies. Additionally we found those who purchase the subscription also use the app more.
Results (3 Personas)
High Value Customer. - Most likely to purchase a subscription. - Very active with Duolingo app - Very committed to learning - High proportion of retirees - Generally, older (55 – 74). - Generally, earn more - Mixed language proficiency
New Language Students - Least likely to purchase a subscription - Generally younger (18-34) - Most earn less than 10k - Learning a language for the first time - Highest probability of being a student or unemployed
Working Adult Reviewer - Reviewing a language they have studied before - Generally middle age (35 – 54) - Generally earning $26k – $75k - Most likely to take a placement test - Highest employment rate
Recommendations for product changes and marketing campaigns - High Value Customer. | Consider developing a loyalty and referral program targeted for this group. Highlight referral scheme, as word of mouth is the best way to win new customers | Have dedicated service representatives if they have issues - New Language Students | Young, group of new language learns. Consider targeting campaigns that will expose them to multiple new languages to help them discover one that interest them. | Appeal to young people’s desire to experience new languages with travel marketing campaigns focus on travel - Working Adult Review | Most likely to review an old language, target notifications and marketing campaigns of relearning an old language. | Most likely to be working a job, consider sending notifications after working hours, when this group is most likely to be active on the app.
Key Visualizations
Data pre-processing.
Daily has more than 50% of values as NAs. Choose to remove this feature

Data pre-processing.
What is the distribution of time spent completing the survey? We do not want survey results that are inaccurate. Histogram of time spent on survey. Choose to remove users who did not spend at least 100 seconds (log10(2)) filling out the survey.

Exploratory data analysis.
As expeted those who purchase a subscription are more active.

Heatmap shows number of active days, lessons started, lessons_completed, and highest crown count seem to correlate with each other. And an overall trend that most of these app usage features are positively correleated with each other
# Check correlations. In order to check correlations, we 1 hot encode the categorical variables purchased_subscription and took_placement_test
df_usage$purchased_subscription <- as.integer(as.logical(df_usage$purchased_subscription))
df_usage$took_placement_test <- as.integer(as.logical(df_usage$took_placement_test))
df_corr <- df_usage[,c("highest_course_progress", "took_placement_test", "purchased_subscription", "highest_crown_count",
"n_active_days","n_lessons_started","n_lessons_completed","longest_streak","n_days_on_platform")]
df_corr <- na.omit(df_corr)
### Get lower triangle of the correlation matrix
cormat <- round(x = cor(df_corr), digits = 2)
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
### Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri
## highest_course_progress took_placement_test
## highest_course_progress 1 0.18
## took_placement_test NA 1.00
## purchased_subscription NA NA
## highest_crown_count NA NA
## n_active_days NA NA
## n_lessons_started NA NA
## n_lessons_completed NA NA
## longest_streak NA NA
## n_days_on_platform NA NA
## purchased_subscription highest_crown_count
## highest_course_progress 0.19 0.66
## took_placement_test 0.07 0.19
## purchased_subscription 1.00 0.29
## highest_crown_count NA 1.00
## n_active_days NA NA
## n_lessons_started NA NA
## n_lessons_completed NA NA
## longest_streak NA NA
## n_days_on_platform NA NA
## n_active_days n_lessons_started n_lessons_completed
## highest_course_progress 0.37 0.26 0.27
## took_placement_test 0.07 0.13 0.13
## purchased_subscription 0.36 0.33 0.33
## highest_crown_count 0.55 0.52 0.52
## n_active_days 1.00 0.50 0.50
## n_lessons_started NA 1.00 0.98
## n_lessons_completed NA NA 1.00
## longest_streak NA NA NA
## n_days_on_platform NA NA NA
## longest_streak n_days_on_platform
## highest_course_progress 0.34 0.45
## took_placement_test 0.02 -0.07
## purchased_subscription 0.25 0.10
## highest_crown_count 0.51 0.35
## n_active_days 0.47 0.17
## n_lessons_started 0.27 0.01
## n_lessons_completed 0.27 0.01
## longest_streak 1.00 0.28
## n_days_on_platform NA 1.00
### Melt
melted_cormat <- melt(upper_tri, na.rm = TRUE)
### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()

Exploratory data analysis
MX, FR, JP and RU have extremely high rates of subscription payments. Their users also share high levels of commitment to learning the language and have the same age profiles (relatively high proportion of 55-74 year olds) These 4 countries are all non-english native speaking. Denmark (DE) is another non-english speaking country with similar age demographics and high levels of commitment, but despite this, has a proportion of subscribes. DE market could be a un-tapped market. One strategy could to be lower the cost for subscriptionsi in Denmark to ease the point of entry



Denmark users have much few 151k+ earners compared to MX, FR, JP and RU. Consider lowing the subscription cost in this region.

Kmodes clustering
Most of the survey data is categorical data, let’s use k-modes to cluster them. The goal of this clustering is to identify certain groups within the data set. First we build our data frame and then use an elbow plot to determin the optimal number k clusters. From the elbow plot we select 3. 
The modes of our clusters
## age annual_income employment_status student
## 1 35 - 54 $26,000 - $75,000 Employed full-time Not currently a student
## 2 18-34 $0 - $10,000 Employed full-time Not currently a student
## 3 55 - 74 $26,000 - $75,000 Employed full-time Not currently a student
## duolingo_subscriber
## 1 No, I have never paid for Duolingo Plus
## 2 No, I have never paid for Duolingo Plus
## 3 Yes, I currently pay for Duolingo Plus
## primary_language_commitment
## 1 I'm moderately committed to learning this language.
## 2 I'm moderately committed to learning this language.
## 3 I'm very committed to learning this language.
## primary_language_review
## 1 I am using Duolingo to review a language I've studied before.
## 2 I am using Duolingo to learn this language for the first time.
## 3 I am using Duolingo to learn this language for the first time.
## primary_language_proficiency took_placement_test n_lessons_completed_cat
## 1 Intermediate 1 2
## 2 Beginner 0 1
## 3 Beginner 1 3
## purchased_subscription
## 1 0
## 2 0
## 3 1
Main figure
Radar chart summarizes the key attributes of each cluster. High Value Customer. - Most likely to purchase a subscription. - Very active with Duolingo app - Very committed to learning - High proportion of retirees - Generally, older (55 – 74). - Generally, earn more - Mixed language proficiency
New Language Students - Least likely to purchase a subscription - Generally younger (18-34) - Most earn less than 10k - Learning a language for the first time - Highest probability of being a student or unemployed
Working Adult Reviewer - Reviewing a language they have studied before - Generally middle age (35 – 54) - Generally earning $26k – $75k - Most likely to take a placement test - Highest employment rate 
Subscription purchases by cluster
The High Value Customer cluster has users who are willing to pay for a subscription
## Committment to learning a language High Value customers are more committed to learning a language
## Age breakdown by clusters. High Value Customers tend to be older while New Language Students are younger.
## Income breakdown by cluster.

Reviewing or learning new language by cluster

3d plots wiht plotly shows our 3 clusters with course progress, purchased_subscription, and active days